Statistical and Machine Learning

Lab3: Classification
Logistic regression and k-nearest neighbors

Tsai, Dai-Rong

Dataset

Pima Indians Diabetes Database

# Set random seed
set.seed(999)

# Packages
library(MASS) # for stepAIC
library(glmnet) # for glmnet, glmnet.cv
library(caret) # for train, trainControl

# Data
data(PimaIndiansDiabetes, package = "mlbench")
  • Response
    • diabetes: test for diabetes (neg / pos)
  • Predictors
    • pregnant: Number of times pregnant
    • glucose: Plasma glucose concentration (glucose tolerance test)
    • pressure: Diastolic blood pressure (mmHg)
    • triceps: Triceps skin fold thickness (mm)
    • insulin: 2-hour serum insulin (\(\mu\)U/ml)
    • mass: Body mass index (weight in kg/\(\text{(height in m)}^2\))
    • pedigree: Diabetes pedigree function
    • age: Age (years)
dim(PimaIndiansDiabetes)
[1] 768   9
head(PimaIndiansDiabetes)
  pregnant glucose pressure triceps insulin mass pedigree age diabetes
1        6     148       72      35       0 33.6    0.627  50      pos
2        1      85       66      29       0 26.6    0.351  31      neg
3        8     183       64       0       0 23.3    0.672  32      pos
4        1      89       66      23      94 28.1    0.167  21      neg
5        0     137       40      35     168 43.1    2.288  33      pos
6        5     116       74       0       0 25.6    0.201  30      neg
proportions(table(PimaIndiansDiabetes$diabetes))

    neg     pos 
0.65104 0.34896 
str(PimaIndiansDiabetes)
'data.frame':   768 obs. of  9 variables:
 $ pregnant: num  6 1 8 1 0 5 3 10 2 8 ...
 $ glucose : num  148 85 183 89 137 116 78 115 197 125 ...
 $ pressure: num  72 66 64 66 40 74 50 0 70 96 ...
 $ triceps : num  35 29 0 23 35 0 32 0 45 0 ...
 $ insulin : num  0 0 0 94 168 0 88 0 543 0 ...
 $ mass    : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
 $ pedigree: num  0.627 0.351 0.672 0.167 2.288 ...
 $ age     : num  50 31 32 21 33 30 26 29 53 54 ...
 $ diabetes: Factor w/ 2 levels "neg","pos": 2 1 2 1 2 1 2 1 2 2 ...

Create Training/Testing Partitions

  • Split data into 80% training set and 20% test set
nr <- nrow(PimaIndiansDiabetes)
train.id <- sample(nr, nr * 0.8)

training <- PimaIndiansDiabetes[train.id, ]
testing <- PimaIndiansDiabetes[-train.id, ]
  • Check dimension
dim(training)
[1] 614   9
dim(testing)
[1] 154   9

Subset Selection

glm.full <- glm(diabetes ~ ., family = binomial, data = training)

Arguments

  • family: a description of the error distribution and link function to be used in the model.
    • binomial(link = "logit")
    • gaussian(link = "identity")
    • poisson(link = "log")
    • quasipoisson(link = "log")
    • See ?family for more family functions.
  • Auxiliary for Controlling GLM Fitting
    • epsilon = 1e-8: positive convergence tolerance \(\epsilon\); the iterations converge when \(\frac{|dev - dev_{old}|}{|dev| + 0.1} < \epsilon\).
    • maxit = 25: maximal number of IWLS iterations.
    • trace = FALSE: logical indicating if output should be produced for each iteration.
  • Stepwise Selection
glm.step <- stepAIC(glm.full, scope = list(lower = ~ 1), direction = "both", trace = 0)
glm.step$anova
Stepwise Model Path 
Analysis of Deviance Table

Initial Model:
diabetes ~ pregnant + glucose + pressure + triceps + insulin + 
    mass + pedigree + age

Final Model:
diabetes ~ pregnant + glucose + pressure + mass + pedigree

       Step Df Deviance Resid. Df Resid. Dev    AIC
1                             605     579.85 597.85
2 - triceps  1 0.058537       606     579.91 595.91
3 - insulin  1 0.231721       607     580.14 594.14
4     - age  1 0.309372       608     580.45 592.45
summary(glm.step)

Call:
glm(formula = diabetes ~ pregnant + glucose + pressure + mass + 
    pedigree, family = binomial, data = training)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -8.18825    0.76804  -10.66  < 2e-16 ***
pregnant     0.15893    0.03090    5.14  2.7e-07 ***
glucose      0.03621    0.00385    9.41  < 2e-16 ***
pressure    -0.01269    0.00553   -2.29  0.02182 *  
mass         0.08477    0.01556    5.45  5.1e-08 ***
pedigree     1.16449    0.33827    3.44  0.00058 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 805.74  on 613  degrees of freedom
Residual deviance: 580.45  on 608  degrees of freedom
AIC: 592.5

Number of Fisher Scoring iterations: 5
  • Prediction

Arguments

  • type
    • "link" (default) : scale of the linear predictors. \(log(\frac{P(Y=1)}{1-P(Y=1)}) = X\beta\)
    • "response": scale of the response variable. \(P(Y=1) = \frac{exp(X\beta)}{1+exp(X\beta)}\)
  • se.fit: logical switch indicating if standard errors are required.
predict(glm.step, newdata = head(testing), type = "link")
       1        3       12       20       37       39 
 0.78802  1.65395  2.39064 -1.20540  0.89490 -1.65133 
predict(glm.step, newdata = head(testing), type = "response")
      1       3      12      20      37      39 
0.68741 0.83942 0.91611 0.23052 0.70990 0.16093 
predict(glm.step, newdata = head(testing), type = "response", se.fit = TRUE)
$fit
      1       3      12      20      37      39 
0.68741 0.83942 0.91611 0.23052 0.70990 0.16093 

$se.fit
       1        3       12       20       37       39 
0.033992 0.043821 0.021990 0.026913 0.048066 0.027057 

$residual.scale
[1] 1

Regularization

  • Construct Design Matrix
x <- model.matrix(diabetes ~ ., data = training)[, -1]
y <- training$diabetes
  • "lpsa ~ .": Create a design matrix for all variables except for lpsa.
  • "[, -1]": Exclude the column of intercepts. (The intercept term is fitted by default in glmnet)

Ridge / Lasso Regression

ridge <- glmnet(x = x, y = y, family = "binomial", alpha = 0) # Ridge
lasso <- glmnet(x = x, y = y, family = "binomial", alpha = 1) # Lasso

Arguments

  • family:
    • One of the built-in families: "gaussian"(default), "binomial", "poisson", "multinomial", "cox", "mgaussian".
    • A glm() family object. (See ?family)

Multinomial logistic regression

  • family = "multinomial"
  • type.multinomial = "grouped" : a grouped lasso penalty is used on the multinomial coefficients for a variable to ensure they are all together. The default is "ungrouped".
codes for plot
par(mfrow = c(2, 2))
plot(ridge, xvar = "lambda", label = TRUE)
plot(ridge, xvar = "norm", label = TRUE)
plot(lasso, xvar = "lambda", label = TRUE)
plot(lasso, xvar = "norm", label = TRUE)
title("Ridge Regression", line = -2, outer = TRUE)
title("Lasso Regression", line = -22, outer = TRUE)

  • Coefficients of Ridge
coef(ridge, s = exp(seq(-4, 4, 2)))
9 x 5 sparse Matrix of class "dgCMatrix"
                s1     s2     s3     s4     s5
(Intercept) -7e+00 -5e+00 -2e+00 -8e-01 -6e-01
pregnant     1e-01  7e-02  2e-02  4e-03  5e-04
glucose      3e-02  2e-02  5e-03  9e-04  1e-04
pressure    -9e-03 -3e-03  4e-04  2e-04  3e-05
triceps      2e-03  2e-03  2e-03  4e-04  6e-05
insulin      4e-05  6e-04  4e-04  9e-05  1e-05
mass         7e-02  4e-02  1e-02  2e-03  3e-04
pedigree     9e-01  6e-01  2e-01  3e-02  5e-03
age          1e-02  1e-02  6e-03  1e-03  2e-04
  • Coefficients of Lasso
coef(lasso, s = exp(seq(-6, -2, 1)))
9 x 5 sparse Matrix of class "dgCMatrix"
                s1     s2     s3    s4    s5
(Intercept) -8e+00 -7.712 -6.938 -5.10 -2.16
pregnant     1e-01  0.132  0.107  0.05  .   
glucose      4e-02  0.034  0.031  0.02  0.01
pressure    -1e-02 -0.009 -0.003  .     .   
triceps     .       .      .      .     .   
insulin     -2e-04  .      .      .     .   
mass         8e-02  0.075  0.060  0.04  .   
pedigree     1e+00  0.975  0.713  0.19  .   
age          4e-03  0.003  .      .     .   

Cross-Validation

  • CV for Ridge
ridge.cv <- cv.glmnet(x = x, y = y, alpha = 0,
                      family = "binomial",
                      type.measure = "deviance",
                      nfolds = 10)
ridge.cv

Call:  cv.glmnet(x = x, y = y, type.measure = "deviance", nfolds = 10,      alpha = 0, family = "binomial") 

Measure: Binomial Deviance 

    Lambda Index Measure     SE Nonzero
min 0.0229   100   0.977 0.0480       8
1se 0.1223    82   1.021 0.0368       8
plot(ridge.cv)

coef(ridge.cv, s = "lambda.min")
9 x 1 sparse Matrix of class "dgCMatrix"
                     s1
(Intercept) -7.1333e+00
pregnant     1.1527e-01
glucose      2.9295e-02
pressure    -9.2739e-03
triceps      1.7856e-03
insulin      3.7511e-05
mass         6.9554e-02
pedigree     9.2749e-01
age          9.6757e-03
  • CV of Lasso
lasso.cv <- cv.glmnet(x = x, y = y, alpha = 1,
                      family = "binomial",
                      type.measure = "deviance",
                      nfolds = 10)
lasso.cv

Call:  cv.glmnet(x = x, y = y, type.measure = "deviance", nfolds = 10,      alpha = 1, family = "binomial") 

Measure: Binomial Deviance 

    Lambda Index Measure     SE Nonzero
min 0.0073    38   0.972 0.0456       6
1se 0.0430    19   1.013 0.0275       4
plot(lasso.cv)

coef(lasso.cv, s = "lambda.min")
9 x 1 sparse Matrix of class "dgCMatrix"
                    s1
(Intercept) -7.6665655
pregnant     0.1310709
glucose      0.0334065
pressure    -0.0085941
triceps      .        
insulin      .        
mass         0.0743105
pedigree     0.9603272
age          0.0030311

Arguments

  • type.measure: loss to use for cross-validation.
    • "default" : MSE for gaussian models, deviance for logistic and poisson regressions, and partial-likelihood for the Cox model.
    • "mse"/"mae": Mean squared/absolute error for all models except the “Cox”.
    • "deviance": Deviance for logistic and poisson regressions.
    • "class": Misclassification error for binomial and multinomial logistic regressions.
    • "auc": Area under the ROC curve for two-class logistic regression.
    • "C": Harrel’s concordance measure for Cox models.
  • s: Value(s) of the penalty parameter \(\lambda\).
    • "lambda.1se" (default): Largest value of \(\lambda\) such that error is within 1 standard error of the minimum.
    • "lambda.min": Value of \(\lambda\) that gives the minimum mean cross-validated error.
    • numeric vector: Value(s) of \(\lambda\) to be used

Elastic-Net

  • CV of Elastic-Net
cv10 <- trainControl(method = "repeatedcv", number = 10, repeats = 5,
                     # Evaluate performance using sensitivity, specificity, AUC
                     summaryFunction = twoClassSummary, classProbs = TRUE)

Arguments

  • method: the resampling method.
    • "boot", "boot632", "optimism_boot", "boot_all", "cv", "repeatedcv", "LOOCV"
    • "LGOCV": for repeated training/test splits
    • "none": only fits one model to the entire training set
    • "oob": for random forest, bagged trees, bagged earth, bagged flexible discriminant analysis, or conditional tree forest models
    • "timeslice", "adaptive_cv", "adaptive_boot", "adaptive_LGOCV"
  • summaryFunction: a function to compute performance metrics across resamples.
    • twoClassSummary: sensitivity, specificity, area under the ROC curve.
    • prSummary: precision, recall, area under the precision-recall curve.
    • multiClassSummary: computes some overall measures of for performance and several averages of statistics calculated from “one-versus-all” configurations.
elnet.tune <- train(
  diabetes ~ .,
  data = training,
  method = "glmnet",
  trControl = cv10,
  tuneLength = 10,
  # Specify which metric to optimize
  metric = "ROC"
)

elnet.tune
glmnet 

614 samples
  8 predictor
  2 classes: 'neg', 'pos' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 553, 553, 553, 552, 553, 552, ... 
Resampling results across tuning parameters:

  alpha  lambda      ROC      Sens     Spec     
  0.1    0.00024468  0.83332  0.87077  0.5971542
  0.1    0.00056524  0.83332  0.87077  0.5971542
  0.1    0.00130578  0.83337  0.87077  0.5971542
  0.1    0.00301653  0.83326  0.87128  0.5953755
  0.1    0.00696858  0.83366  0.87179  0.5900791
  0.1    0.01609832  0.83347  0.87590  0.5748221
  0.1    0.03718921  0.83234  0.88872  0.5505534
  0.1    0.08591190  0.83097  0.90872  0.5156522
  0.1    0.19846764  0.82851  0.93487  0.4469960
  0.2    0.00024468  0.83348  0.87077  0.5971542
  0.2    0.00056524  0.83348  0.87077  0.5971542
  0.2    0.00130578  0.83348  0.87077  0.5971542
  0.2    0.00301653  0.83328  0.87128  0.5944664
  0.2    0.00696858  0.83363  0.87179  0.5874308
  0.2    0.01609832  0.83370  0.87795  0.5703953
  0.2    0.03718921  0.83290  0.89282  0.5477866
  0.2    0.08591190  0.83147  0.91641  0.5182213
  0.2    0.19846764  0.83031  0.95077  0.4264822
  0.3    0.00024468  0.83348  0.87077  0.5971542
  0.3    0.00056524  0.83348  0.87077  0.5971542
  0.3    0.00130578  0.83353  0.87077  0.5971542
  0.3    0.00301653  0.83353  0.87077  0.5935968
  0.3    0.00696858  0.83399  0.87333  0.5865217
  0.3    0.01609832  0.83399  0.87692  0.5676680
  0.3    0.03718921  0.83336  0.89641  0.5406719
  0.3    0.08591190  0.83222  0.92154  0.5092490
  0.3    0.19846764  0.82511  0.96359  0.3575889
  0.4    0.00024468  0.83355  0.87077  0.5971542
  0.4    0.00056524  0.83355  0.87077  0.5971542
  0.4    0.00130578  0.83360  0.87077  0.5971542
  0.4    0.00301653  0.83383  0.87077  0.5935968
  0.4    0.00696858  0.83408  0.87333  0.5865217
  0.4    0.01609832  0.83412  0.87949  0.5677075
  0.4    0.03718921  0.83313  0.89692  0.5397233
  0.4    0.08591190  0.83212  0.92462  0.5083399
  0.4    0.19846764  0.81307  0.97128  0.3023320
  0.5    0.00024468  0.83369  0.87077  0.5971542
  0.5    0.00056524  0.83369  0.87077  0.5971542
  0.5    0.00130578  0.83364  0.87128  0.5953755
  0.5    0.00301653  0.83390  0.87077  0.5927273
  0.5    0.00696858  0.83397  0.87385  0.5856522
  0.5    0.01609832  0.83450  0.88308  0.5650198
  0.5    0.03718921  0.83312  0.89692  0.5343083
  0.5    0.08591190  0.83050  0.92513  0.5057708
  0.5    0.19846764  0.80200  0.97436  0.2534387
  0.6    0.00024468  0.83378  0.87077  0.5962846
  0.6    0.00056524  0.83378  0.87077  0.5962846
  0.6    0.00130578  0.83362  0.87128  0.5953755
  0.6    0.00301653  0.83403  0.87179  0.5918577
  0.6    0.00696858  0.83399  0.87538  0.5856522
  0.6    0.01609832  0.83500  0.88462  0.5640711
  0.6    0.03718921  0.83339  0.89692  0.5316601
  0.6    0.08591190  0.82736  0.92359  0.5030435
  0.6    0.19846764  0.79124  0.97795  0.2035573
  0.7    0.00024468  0.83376  0.87077  0.5962846
  0.7    0.00056524  0.83376  0.87077  0.5962846
  0.7    0.00130578  0.83359  0.87128  0.5953755
  0.7    0.00301653  0.83415  0.87179  0.5918577
  0.7    0.00696858  0.83444  0.87487  0.5829644
  0.7    0.01609832  0.83519  0.88615  0.5613439
  0.7    0.03718921  0.83392  0.89692  0.5316996
  0.7    0.08591190  0.82206  0.92564  0.4781028
  0.7    0.19846764  0.78866  0.98718  0.1294862
  0.8    0.00024468  0.83369  0.87077  0.5962846
  0.8    0.00056524  0.83369  0.87077  0.5962846
  0.8    0.00130578  0.83366  0.87128  0.5953755
  0.8    0.00301653  0.83426  0.87128  0.5909881
  0.8    0.00696858  0.83469  0.87538  0.5784585
  0.8    0.01609832  0.83505  0.88667  0.5569170
  0.8    0.03718921  0.83355  0.89744  0.5307510
  0.8    0.08591190  0.81532  0.92769  0.4638735
  0.8    0.19846764  0.78850  0.99538  0.0490514
  0.9    0.00024468  0.83373  0.87128  0.5962846
  0.9    0.00056524  0.83373  0.87128  0.5962846
  0.9    0.00130578  0.83382  0.87077  0.5945059
  0.9    0.00301653  0.83442  0.87282  0.5900791
  0.9    0.00696858  0.83478  0.87590  0.5775494
  0.9    0.01609832  0.83546  0.88821  0.5541502
  0.9    0.03718921  0.83320  0.89897  0.5271937
  0.9    0.08591190  0.80958  0.93179  0.4504743
  0.9    0.19846764  0.78850  1.00000  0.0027273
  1.0    0.00024468  0.83373  0.87128  0.5962846
  1.0    0.00056524  0.83373  0.87128  0.5962846
  1.0    0.00130578  0.83392  0.87026  0.5945059
  1.0    0.00301653  0.83442  0.87282  0.5892095
  1.0    0.00696858  0.83515  0.87692  0.5775099
  1.0    0.01609832  0.83496  0.88821  0.5541107
  1.0    0.03718921  0.83204  0.89897  0.5227273
  1.0    0.08591190  0.80302  0.93590  0.4264822
  1.0    0.19846764  0.78850  1.00000  0.0000000

ROC was used to select the optimal model using the largest value.
The final values used for the model were alpha = 0.9 and lambda = 0.016098.

Arguments

plot(elnet.tune)

plot(elnet.tune, plotType = "level")

  • Final Model
(elnet.optim <- elnet.tune$bestTune)
   alpha   lambda
78   0.9 0.016098
elnet <- glmnet(x = x, y = y, family = "binomial",
                alpha = elnet.optim$alpha,
                lambda = elnet.optim$lambda)
coef(elnet)
9 x 1 sparse Matrix of class "dgCMatrix"
                     s0
(Intercept) -7.09642400
pregnant     0.11353076
glucose      0.03111434
pressure    -0.00439644
triceps      .         
insulin      .         
mass         0.06384842
pedigree     0.78410701
age          0.00099008

k-nearest neighbors (k-NN)

  • CV of k-NN
knn.tune.1 <- train(
  diabetes ~ .,
  data = training,
  method = "knn",
  trControl = cv10,
  tuneGrid = data.frame(k = seq(3, 31, 2)),
  metric = "ROC"
)

knn.tune.2 <- train(
  diabetes ~ .,
  data = training,
  method = "knn",
  preProcess = c("center", "scale"),
  trControl = cv10,
  tuneGrid = data.frame(k = seq(3, 31, 2)),
  metric = "ROC"
)
knn.tune.1$bestTune
   k
6 13
plot(knn.tune.1)

knn.tune.2$bestTune
    k
15 31
plot(knn.tune.2)

Prediction

Logistic regression with stepwise selection

  • See ?predict.glm
  • type: “link”, “response”, “terms”
glm.prob <- predict(glm.step, testing, type = "response")
pred.step <- ifelse(glm.prob >= 0.5, "pos", "neg")

Logistic regression with regularization

  • See ?predict.glmnet
  • type: “link”, “response”, “coefficients”, “nonzero”, “class”
z <- model.matrix(diabetes ~ ., data = testing)[, -1]
pred.elnet <- predict(elnet, newx = z, type = "class")

k-NN

  • See ?predict.train
  • type: “raw”, “prob”
pred.knn.1 <- predict(knn.tune.1, testing)
pred.knn.2 <- predict(knn.tune.2, testing)
  • Accuracy
acc <- sapply(mget(ls(pattern = "^pred")), \(x) mean(x == testing$diabetes))
sort(acc)
pred.knn.1 pred.knn.2 pred.elnet  pred.step 
   0.75974    0.76623    0.77273    0.77273